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Ideas 

• Needle in haystack problem 

® Sampling data does not work (may not sample 
the entire needle) 

• Outline 

- Problem 

- Approach 

• Supervised, unsupervised, semisupervised 

• New similarity measures 

• Kernel methods 

• PC 

• MDS with kernels 
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Problem Introduction 

NASA programs have large numbers (and types) of 
problem reports. 

• ISS PRACA: 3000+ records, 1-4 pages each; 

• ISS SCR: 28,000+ records, 1-4 pages each; 

• Shuttle CARS: 7000+ records, 1-4 pages each; 

• ASRS: 27000+ records, 1 paragraph each 

These free text reports are written by a number of different 
people, thus the emphasis and wording vary considerably 

With so much data to sift through, analysts (subject 
experts) need help identifying any possible safety issues or 
concerns and to help them confirm that they haven’t 
missed important problems. 

• Unsupervised clustering is the initial step to accomplish this; 

• We think we can go much farther, specifically, identify possible 
recurring anomalies. 

• Recurring anomalies may be indicators of larger systemic problems. 


Text Mining Solution - ReADS 

Recurring Anomaly Discovery System 
(ReADS): 

• The Recurring Anomaly Detection System 
(ReADS) is an integrated secure online tool to 
analyze text reports, such as aviation reports 
and maintenance records. 

- Text clustering algorithms group large quantities of 
reports and documents. 

• Reduces human error & fatigue 

- Automates the discovery of unknown recurring 
anomalies; 

- Identifies interconnected reports; 

- Provides a visualization of the clusters and recurring 
anomalies 




s Recurrent failures 


Recurring Anomaly 
f . , “Fingerprints” 

t fai nraQ ^ 1 


s Problems that cross traditional system boundaries so 
failure effects are not fully recognized 

• Evidence of unconfirmed or random failures 

S Problems that have been accepted by repeated waivers 

S Discrepant conditions repeatedly accepted by routine 
analysis 

• Problems that are the focus of alternative opinions within 
the engineering community 


ReADS Text Mining Algorithms 

Unsupervised Clustering: 

Spherical k-means -> modified von Mises Fisher, 

Recurring Anomaly Identification: 

1 . Identify reports which mention other reports as a 
recurring anomaly; 


2. Detect recurring anomalies, 

a. find the similarity between documents to detect recurring 
anomalies using cosine distance similarity measure, 

b. then according to the similarity measure, run the hierarchical 
■ clustering algorithm to cluster the recurring anomalies. 




Similarity between Reports 


Cosine Similarity Measure 


Calculate the inner product of the normalized term frequency vectors 


R(d t \d t ) = cosd t d , 

TL w t{ d i) w i( d .) 


d t d t 


Hierarchical Clustering of Recurring 

Anomalies 

• After calculating the distance between each document, 
the algorithm applies single linkage, i.e., nearest 
neighbor, to create a hierarchical tree representing 
connections between documents. 

- Also generates an ‘inconsistency coefficient’ which is a measure 
of the relative consistency of each link in the tree. 

• The hierarchical tree is partitioned into clusters by setting 
a threshold on the inconsistency coefficient. 

- A high inconsistency coefficient implies that the reports could be 
very different and still be sorted into the same cluster. 

• Currently the inconsistency coefficient threshold is set 
very low, which returns many smaller clusters of very 
similar reports. 





ReADS System & 

Visualization 




Jf o Online search & 
text mining 
system 


ECS Mishap and Anomaly 
In form ation System 

Features ' 



jSSjl \ 1 
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Sample Recurring 
Anomalies 




ReADS visualization shows 
documents as boxes. Connections 
between reports are shown by solid 
lines and arrows. 


Intro 


In an attempt to quantify any improvements Natural Language Processing (NLP) & text 
normalization have on text classification using Support Vector Machines (SVM) and 
Naive Bayes, we did a direct comparison of classification rates of documents that has 
been processed by: 

(1 ) documents processed using a NLP tool & a text normalization tool, PLADS, and 

(2) the same documents with no preprocessing. 

Specifically, we: 

• Measured the difference in Precision, Recall, and F-Measure 

• Applied to 60 anomaly classification. 

• Not meant to be an optimum classifier technique. Precision and Recall results for the different preprocessing methods were 
compared. No work was done to improve either. 

Dataset used: 

• Aviation Safety Reporting System (ASRS) 

• ASRS is classified by anomalies. These reports are classified into over 1 00 anomalies. Each 
report may be classified in multiple anomaly classes. 

• 30% are in only one anomaly class 

• 50% are in 3 anomaly classes 

• Documents are short, approximately 6 sentences 

• 27,596 documents 

• Training Dataset: 20,000 docs dedicated to training, 4000 selected 

• Test Dataset: 7,000 docs dedicated to testing, 2000 selected 

Tools used: 

• MATLAB used tor preprocessing 

» Weka implemented for SVM and Naive Baves classification 1 








Sample PLADS Term Reduction 


JUST PRiOR TO TOUCHDOWN, LAXTWR TOLD US TO GO AROUND BECAUSE OF THE ACFT IN FRONT Of US. BOTH THE COPLT AND I, 
HOWEVER, UNDERSTOOD TWRTO SAY, CLRED TO LAND, ACFT ON THE RWY' SINCE THE ACFT IN FRONT OF US WAS CLR OF THE 
RW Y and we both MiSUN D ERSTOOD TWR'S radio call and considered it an advisory, we landed, as we taxied to the 

GATE, TW R REQUESTED THAT i CALL THEM FROM A PHONE WHEN I HAD THE OPPORTUNITY (f CALLED FROM THE GATE). IT WAS ON THE PHONE 
THAT I DISCOVERED TWR HAD SENT US AROUND. IN HINDSIGHT, FROM THSR PERSPECTIVE, GOING AROUND WAS THE PRUDENT THING TO DO. 1 
HAVE BECOME TOO CONDITIONED IN THE PAST FEW YRS IN BEING VECTORED INTO A VISUAL APGH BEHIND AN ACFT THAT IS TOO CLOSE 
REGRETTABLY, IN THIS SIT, CONFUSION AND MISUNDERSTANDING PUT US IN A DIFFICULT SIT. . 


Expand Acronyms, Simplify Punctuation 


JUST PRIOR TO TOUCHDOWN, LAX tower TOLD US TO GO AROUND BECAUSE OF THE aircraft IN FRONT OF US. BOTH THE Copilot AND I, 

however, understood tower to say, clear to land, aircraft on the runway, since the aircraft in front of us was clear of the 
runway and we both misunderstand tower radio call and considered it an advisory, we LANDED. aswe taxied to the gate 

tOWer REQUESTED THAT I CALL THEM FROM A PHONE WHEN 1 HAD THE OPPORTUNITY I CALLED FROM THE GATE. IT WAS ON THE PHONE THAT I 

DISCOVERED tower had sent us around, in hindsight, from their perspective, going around was the PRUDENT THING 

TO DO. I HAVE BECOME TOO CONDITIONED IN THE PAST FEWyear IN BEING VECTORED INTO A VISUAL approach BEHIND AN aircraft THAT IS TOO 

close, regrettably, in this situation, confusion and MISUNDERSTANDING put us in a DIFFICULT situation. 


j-Stemming, Remove Non-informative Terms, PhrasingJ, 


PRIOR _ TOUCHDOWN _ tower TOLD g OaTOUnd aircraft _ FRONT copilot understand tower _ SAY dear _ LAND aircraft __ runway 

_ _ aircraft _ FRONT dear runway misunderstand tower RADIO CALL _ consider advise _ Ian taxiedtO _ GATE tower request CALL 

PHONE OPPORTUNITY _ call GATE PHONE dlSCOVer tower _ SENT HINDSIGHT _ _ PERSPECTIVE go 

prudentthi ng condition PAST _ year vector VISUM, approach _ _ aircraft CLOSE REGRETTABLY _ _ Situate Confuse _ 

misunderstand put difRcultsituation 


Raw Text & PLADS Comparison 


In order to classify the 
documents, they are first 
formatted into a document-term 
frequency matrix. The cells of 
the matrix are the frequency 
count of the terms that appear in 
the document. 






Term 4 • 

Document 

1 

0 

■ 

0 

B 

Document 

2 

0 

3 

0 

0 

Document 

3 

2 

8 

1 
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• PLADS reduced the total number of terms in 27000 documents 
from 44940 to 31701 

• PLADS reduced classification computation time by 0%-1 0% 


















Comparison of Raw Text vs. PLADS 
using SVM 


Difference Chart: SVM 


_AiLt©rms-ns©d .on 


additional term 
reduction applied 
PLADS improves 
precision 2% on 
average 

PLADS improves 
recall 2% on average 


Comparison of Raw Text vs. PLADS 
using Naive Bayes 


All terms used, no 
additional term 
reduction applied 
PLADS improves 
Naive Bayes 
precision 1% on 
average 

PLADS improves 
Naive Bayes recall 
2% on average 





Comparison of Raw Text vs. PLADS 
with Terms Selection 


1000 terms 
selected using 


Information Gain 
PLADS improves 
precision 2% on 
average 

PLADS improves 
recall 3% on 
average 


Difference Chart: SVM w/Term Selection 










Comparison of Raw Text vs. NLP with 
Terms Selection 


500 terms 
selected using 
Information 
Gain 

NLP improves 
F-measure 3% 
on average 


Difference: SVM w/ NLP 


Anomaly 
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T ext Categorization - Applications 

.■Automated sorting of scientific articles according 
to predefined' thesauri of technical words. 

«s Filing patents into patent directories 

b Selective dissemination of information to 
consumers 

« Automated population of hierarchical catalogues 
of web- resources - 

■ Spam filtering 

h Identification of document genre 
* Authorship attribution 
b Automated Essay Grading ■ 
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Tool Kit 

s Involves the synergy of the Information 
trieval (IR) Technology and Machine 
arning (ML) Technology 

a Support Vector Machines 
■ Neural Networks 
a Boosting Algorithms 

■ Latent Semantic Analysis: 

Natural Language Processing (NLP) can 
be used to integrate morphological, 
syntactic and semantic analysis with the 
process of clustering documents, 
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TF-IDF Classifier 


Uses^tiTeBag'ofWords^pt^ - f>\ *> 

• Bach Document is cepreseirffed'asa^row. 

Ceils" .of the matrix a |g| f P0,ff;f er^mj%feq u eocy Inverse Document 

I and document 


Fr equency;), for the corresponding} w5rd; 


Term Frequency -(TF)-.: Frequency ;of occurrence: of word w- in 
tttv document d : 

\D\ ' . : Total nu.mberof documents ; ; 

DF(iv) : Total number of documents containing word 


- ■ 


TFIDF Classifyer 


Prototype vectors are generated for each 
class: ■■■■miiB 


Decision is taken by measuring the cosine 
of the angle between the prototype vector 
and the data vector. 




\lDF(w) = log 


d& = TF(w i ,d) ■ IDFiwi) 


ffTFIDF {<^0 











HBAYEs(d') = argmax Pr(C|d') 

CtC 


Pr(rf'l(7) ■ Pi (C) 


Pr(CM') 


Pr(/|C') - Pr(iei|0 


Underlying Assumptions 


1. We have \C\ probability distributions. 

2. Each document is generated from the p.d.f. 
associ ated with that particular c la ss. The Yth 
word of the documentJs .generated from" the Y th 
independent trial. 


This is calculated by Bayes rule 



I HprTFlDF(d') = aEgmaxJ’i(C|if , ©) 

PrfCid', &) ='X ‘ PK*K. if 


Pr(®|C') ■ Pi(C") 


TF(w'.ti’) 


Pr(miC) • Pt(G) . 

• Ec- c cW w |C'}-Pr(C') 


H p r m DF^d') — arginax 


PrTFIDF 


Assumptions: 

- Each document has a representation V. 

- These representations are not; unique. A function 0 
maps a document to its representation. 


Function 0 is design choice. Lets say x = w and 
Pr(x/d' / Q)=Pr(w/d',e) 

So documents are represented by single words. They do 
not have one fixed representation. 


where, Pr(C) is the prior probability and 
Resulting Rule: 
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A - y; TF(w', C) = / 22 (TP^.C) • IDP(u>')) 3 
rtf’ . Vwrtr . . 


\ DF'(w) 
TF(w,C) 


H’fFIDFid ') 


•FIDft 


Assumptions to show the equivalence of PrTFID 
and TFIDF: ' ; ' : * 

- Equal Prior probabilities,. 

•• There is a A such that: WHBgWBWliaBgaiCTA 1 ! 


DetineMTJP as 


Under these assumptions jtcan,ba_shown that 


classification 



v ,V ?/y y ; 

for data 

ittslllt 

s y >/” , : 


'ti-'/- 

HMI 


Clustering Of DirectionakBata 
Two Broad Kinds Of Algorithms 

Generative { parametric) 

Examples : Mixture of Gaussians; 

Mixture ofVMF distributions 
Model using the exponential family 

2 . Discriminative (non - parametric) 

K-means - measures the Euclidean 

• distance 

spK-means - measures cosine 

similarity. • 

fsK-means - frequency sensitive • 
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Brief look at El\ 


» We have M data poi nts S that we want to fit using a 
mixture of K univariate Gaussian distributions with 
identical and known variance; 

a Problem: We don't know which data point was generated 
using which iof th e: distributions . Represent data points as 

: : : 

where w rvk -- 1, if Y,., was generated using distribution k, 
otherwise 0. 

The ML solution is given by: 


where Kjlil and k - 

But w mk are not known. So we know neither w mk nor, the 
p fc The idea of EM is to estimate both simultaneously. 
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This'correspond&taclusterin'g data points' by minimizing 
the Ead i dean- dl stances- < m the k:- means a I gori tb nr. 


Maxi- z ' 

Using the Expected values of W.-, t the ML estimates of p. 
are calculated. This corresponds to updating the k-means 

at every iteration of the k-means algorithm. 



hm is also used on a mixture of 

I f§ 1 ■ ' ' I Hi I f; 1 vM Id: 1 Iti' J lli !! 


VMF p H ||j m ' j | * 

- Introduced by von Mises to study the 
deviations of measure atomic weights from 
integral values. ■ 


- Its importance in statistical inference on a 
circle is almost the same as that of the normal 
distribution on a line. 


15 





Von Mises Fisher Distribution 

« A circular random variable '9' is said to follow a 
von Mises Distribution if its p.d.f. is given by: 



g(d\fl 0 ,K)=- 


<0<l7t,K>Q,O<fJ. o < In, 



2 *r 0 (*> 

where, I 0 (;r) is the modified bessel function of the first kind and order 0 


The parameter p 0 is the mean direction while, 
the parameter K is described as the L " c 
concentration parameter. 

A 'd' dimensional unit random vector, x with 
! 1 x i 1 = 1 is said to have d-variate VMF distribution 
if its p.d.f. is given by. 


where fj jj j | = 1 and K >=/0 and 
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Why is the text data Directional ? 


Preprocessing step before applying the 
algorithms to text data r The (tf-idf) document 
vectors are L >: norm a 1 1 zed to, make them unit, 
norm.. . - 

Assumption : Direction of documents is sufficient 
to get good clusters. . 

For Eg: Two documents - one small, one lengthy; 
- on the same topic will have the same direction' 
and hence put in the same cluster; 

This unit normalized data lives on a sphere- in a 
RW-i) dimensional space. ■ . ■ 


Density of tie von Mises distribution for k 
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Analogies to the Normal 
Distribution 

1. For large K the random variable '0' is distributed as 

i\i(aCi/i« ; ? . 

- (proof in handout!). 


: Relation to Bivariate Normal Distribution: 

Let x and y be independent normal variables with 
means (cos p 0/ sin p 0 ) and equal variances 1/ K, 
The p.d.f. of the polar variables (r, 0) is : 


The conditional distribution of 0 for r = 1, is the 
VMF(p, K), 

These clearly indicate that p 0 behaves like the mean 
while 1/ K influences the distribution in the same 
way as a 2 influences normal Distribution. 
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X„) / g (.&, - * 0 ) = 0 


Since the above two equations are identical for each n, wehave 
g'(0, - *„)/*(*, - x 0 ) = const. sin(£,. - * 0 ). 

Replacing x 0 by , g(9) is the VMF pdf. 


3.- The Maximum DfeelThooct Characterization;: 

'.;y 

'Fora p.a.f. on the Ji tie f(>c - p) r The maximurrr likelihood ! | | 
estimate for the mean is given by the sample mean if only if 

, WmRm ' ffi Pj Nw## 1 

Lrkewfse, fora p.cfcft on the, circli^P 

the mean is the sample- mean fjj if only ifg is a VMF 

Proof: 

According: to Log Likelihood Function if r t he: -.ML estimate of 'p c is: 
the sample mean direction, §j then: RHMNHE9MMHEI 


By definition of 



Maxim-urn Entropy Characterization: 

Given a fixed' mean and variance the Gaussian is 
the. distributforr-that maximizes the: entropy. 


Li ke wise g i ve n - a f i xed circu I a r va ria nee . p a n d 
mean direction p 0 , the. VMF distribution ' 
maximizes the entropy. 


Proof: 

• Given in Handout 







to mode! Directional Date 


Unfortunately there is no distribution for 
directional data which has all properties 
analogous to the linear normal 
distribution. The VMF has some but not ail 
of these desirable properties. 

The wrapped normal distribution is a 
strong contender to VMF. 

But the VMF provides: 

- simpler ML estimates. 

‘ - tractable distribution in 
hypothesis testing. <-7 [ : 7 
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/(x|0) = £>A(x|ft*) 




VMF implemented Using the EM 

: framework i k 

Frame Work : 

The probability density of the rnovMF 
generatLv.e_moiielJs given by: 


Frame Work - contd. 

■ Let SSS8SSH be generated by sampling 
independently from this generative 'model. 

■ LetnObe the corresponding set of so 

called hidden variables. ________ 

» Z = h if Xj was generated following QBSjfll 

■ With the knowledge of the hidden variables/ 
the log-likelihood is given by, 


inPf^jrje) =X ln ( Q ^^( X4 l^h) 

i—l 


Hidden variables are unknown. The above Eq - 
is a random variable dependent on the ■ . 
distribution of Z. This is the complete log 
likelihood function. 











piK\Ki,Q),yh,i, 


E r [kiP{x,z\e)\ = 52 #|x s ,e) 


+ 52 52 (i p ./fcfxtifa)) p(/i|x 4 ,©} 


\Ph Ph 


The expression for a h is found by the method of Lagrangian 
muitipliers with 3MBH - 


Again the S ; are estimated with the condition 



Estimation 

Given (X,8) we estimate the 
conditional distribution of Z/(X, 0) 

Soft moVMF : Distribution of the 
hidden variable is given by: 


■ Hard moVMF : Distribution is given by 







Algorithm 1 soft-movMF 


ELi x ’Kfyx*,s>)l{ 

ELiKft|xi,e) 


until convergence 


Frequency Distribution of Human Clustered Reports 


Frequency; Distribution ofk-mearrs clustered Reports 


Category- 

Frequency Distribution of von Mises Ftsher clustered Reports 











Sammon Mapping 


■ Project High Dimensional' Data on to a two or 
three dimensional space. 

* A set of N data points Z, are embedded in an ! 
dimensional space 

» A new set of N data points are generated such 
that the following is minimized. 






d measures the Euclidean distance between real vectors. 











A projection of 500 dimension document vectors 
into two dimensions using Sammon Maps[2] 


Embed document vectors in a possibly 
high dimensional space using Mercer 
Kernels. 

Mercer kernel -- Measure of similarity 

\ a 'Gaussian Kernel . 


= Polynomial Kernel 


KjZi'Z,) = (zg =<zwz f >p 




Spectral Clustering Contd 


Kernel Matrix: 

The (i,j) th entry corresponds to 
the similarity between documents i 
and j as measured by the kernel 
function. 


■ “o. ■' 2;" 

4 ■ ' •' : 5-' 

Cl lister t&imb=r 

m'-y - 

'12- 'V 14 . '-IS 
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System Architecture- An Example 


Objectives 


A streamlined and efficient method 
for analyzing problem reports. 


Enhance clustering of problem 
reports to discover recurring 
anomalies 





System Model: 

■: An engineering mode! that defines how 
parts components and subsystems interact 


Relational Database : 

. : : ; ; 

1 a Consists of tables for all the parts, 
\ subsystems and components. 


flpwip 

■ < 

si > . • 
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Relational Database Framework 
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PROBLEM ID 


COMPONEr4T_lD 


DESCRIPTION 

SUBS'iSTEMJD 

COMFOMENTJD 


PARTJD 

COMPONENTS 
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